Split band linear prediction vocoder with pitch extraction

ABSTRACT

A speech coder includes an encoder using an analysis and synthesis approach. The encoder uses a pitch determination algorithm requiring analysis in both the frequency domain and the time domain, a voicing determination algorithm and an algorithm for determining spectral amplitudes and means for quantising the values determined. A decoder is also described.

This invention relates to speech coders.

The invention finds particular, though not exclusive, application intelecommunications systems.

According to one aspect of the invention there is provided a speechcoder including an encoder for encoding an input speech signal dividedinto frames each consisting of a predetermined number of digitalsamples, the encoder including: linear predictive coding (LPC) means foranalysing samples and generating at least one set of linear predictioncoefficients for each frame; pitch determination means for determiningat least one value of pitch for each frame, the pitch determinationmeans including first estimation means for analysing samples using afrequency domain technique (frequency domain analysis), secondestimation means for analysing samples using a time domain technique(time domain analysis) and pitch evaluation means for using the resultsof said frequency domain and time domain analyses to derive a said valueof pitch; voicing means for defining a measure of voiced and unvoicedsignals in each frame; amplitude determination means for generatingamplitude information for each frame, and quantisation means forquantising said set of linear prediction coefficients, said value ofpitch, said measure of voiced and unvoiced signals and said amplitudeinformation to generate a set of quantisation indices for each frame,wherein said first estimation means generates a first measure of pitchfor each of a number of candidate pitch values, the second estimationmeans generates a respective second measure of pitch for each of saidcandidate pitch values and said evaluation means combines each of atleast some of the first measures with the corresponding said secondmeasure and selects one of the candidate pitch values by reference tothe resultant combinations.

According to another aspect of the invention there is provided a speechcoder including an encoder for encoding an input speech signal, theencoder comprising means for sampling the input speech signal to producedigital samples and for dividing the samples into frames each consistingof a predetermined number of samples, linear predictive coding (LPC)means for analysing samples and generating at least one set of linearprediction coefficients for each frame, pitch determination means fordetermining at least one value of pitch for each frame, voicing meansfor defining a measure of voiced and unvoiced signals in each frame,amplitude determnination means for generating amplitude information foreach frame, and quantisation means for quantising said set of linearprediction coefficients, said value of pitch, said measure of voiced andunvoiced signals and said amplitude information to generate a set ofquantisation indices for each frame, wherein said pitch determinationmeans includes pitch estimation means for determining an estimate of thevalue of pitch and pitch refinement means for deriving the value ofpitch from the estimate, the pitch refinement means defining a set ofcandidate pitch values including fractional values distributed aboutsaid estimate of the value of pitch determined by the pitch estimationmeans, identifying peaks in a frequency spectrum of the frame, for eachsaid candidate pitch value correlating said peaks with amplitudes atdifferent harmonic frequencies (kω_(o)) of a frequency spectrum of theframe, where ${\omega_{o} = \frac{2\pi}{P}},$

P is a said candidate pitch value and k is an integer, and selecting asa said value of pitch the candidate pitch value giving the maximumcorrelation.

According to a further aspect of the invention there is provided aspeech coder including an encoder for encoding an input speech signal,the encoder comprising means for sampling the input speech signal toproduce digital samples and for dividing the samples into frames, eachconsisting of a predetermined number of samples, linear predictivecoding (LPC) means for analysing samples and generating at least one setof linear prediction coefficients for each frame, pitch determinationmeans for determining at least one value of pitch for each frame,voicing means for determining for each frame a voicing cut-off frequencyfor separating a frequency spectrum from the frame into a voiced partand an unvoiced part without evaluating the voiced/unvoiced status ofindividual harmonic frequency bands, amplitude determination means forgenerating amplitude information for each frame, and quantisation meansfor quantising said set of coefficients, said value of pitch, saidvoicing cut-off frequency and said amplitude information to generate aset of quantisation indices for each frame.

According to a yet further aspect of the invention there is provided aspeech coder including an encoder for encoding an input speech signal,the encoder comprising, means for sampling the input speech signal toproduce digital samples and for dividing the samples into frames eachconsisting of a predetermined number of samples, linear predictivecoding (LPC) means for analysing samples and generating at least one setof linear prediction coefficients for each frame, pitch determinationmeans for determining at least one value of pitch for each frame,voicing means for defining a measure of voiced and unvoiced signals ineach frame, amplitude determination means for generating amplitudeinformation for each frame, and quantisation means for quantising saidset of prediction coefficients, said value of pitch, said measure ofvoiced and unvoiced signals and said amplitude information to generate aset of quantisation indices for each frame, wherein the amplitudedetermination means generates, for each frame, a set of spectralamplitudes for frequency bands centred on frequencies harmonicallyrelated to the value of pitch determined by the pitch determinationmeans, and the quantisation means quantises the normalised spectralamplitudes to generate a first part of an amplitude quantisation index.

According to a yet further aspect of the invention there is provided aspeech coder including an encoder for encoding an input speech signal,the encoder comprising means for sampling the input speech signal toproduce digital samples and for dividing the samples into frames eachconsisting of a predetermined number of samples, linear predictivecoding means for analysing samples to generate a respective set of LineSpectral Frequency (LSF) coefficients for a leading part and for atrailing part of each frame, pitch determination means for determiningat least one value of pitch for each frame, voicing means for defining ameasure of voiced and unvoiced signals in each frame, amplitudedetermination means for generating amplitude information for each frame,and quantisation means for quantising said sets of LSF coefficients,said value of pitch, said measure of voiced and unvoiced signals andsaid amplitude information to generate a set of quantisation indices,wherein said quantisation means defines a set of quantised LSFcoefficients (LSF′2) for the leading part of the current frame by theexpression

LSF′2=αLSF′1+(1−α) LSF′3,

where LSF′3 and LSF′1 are respectively sets of quantised LSFcoefficients for the trailing parts of the current frame and the frameimmediately preceding the current frame, and α is a vector in a firstvector quantisation codebook, defines each said set of quantised LSFcoefficients LSF′2,LSF′3 for the leading and trailing parts respectivelyof the current frame as a combination of respective LSF quantisationvectors Q2,Q3 of a second vector quantisation codebook and respectiveprediction values P2,P3, where P2=λQ1 and P3=λQ2, λ is a constant and Q1is a said LSF quantisation vector for the trailing part of saidimmediately preceding frame, and selects said vector Q3 and said vectora from the first and second vector quantisation codebooks respectivelyto minimise a measure of distortion between the LSF coefficientsgenerated by the linear predictive coding means (LSF2, LSF3) for thecurrent frame and the corresponding quantised LSF coefficients (LSF′2,LSF′3).

According to yet a further aspect of the invention there is provided aspeech coder for decoding a set of quantisation indices representing LSFcoefficients, pitch value, a measure of voiced and unvoiced signals andamplitude information, including processor means for deriving anexcitation signal from said indices representing pitch value, measure ofvoiced and unvoiced signals and amplitude information, a LPC synthesisfilter for filtering the excitation signal in response to said LSFcoefficients, means for comparing pitch cycle energy at, the LPCsynthesis filter output with corresponding pitch cycle energy in theexcitation signal, means for modifying the excitation signal to reduce adifference between the compared pitch cycle energies and a further LPCsynthesis filter for filtering the modified excitation signal.

Embodiments according to the invention are now described, by way ofexample only, with reference to the accompany drawings in which:

FIG. 1 is a generalised representation of a speech coder;

FIG. 2 is a block diagram showing the encoder of a speech coderaccording to the invention;

FIG. 3 shows a waveform of an analogue input speech signal;

FIG. 4 is a block diagram showing a pitch detection algorithm used inthe encoder of FIG. 2;

FIG. 5 illustrates the determnination of voicing cut-off frequency;

FIG. 6(a) shows an LPC Spectrum for a frame;

FIG. 6(b) shows spectral amplitudes derived from the LPC spectrum ofFIG. 6(a);

FIG. 6(c) shows a quantisation vector derived from the spectralamplitudes of FIG. 6(b);

FIG. 7 shows the decoder of the speech coder;

FIG. 8 illustrates an energy-dependent interpolation factor for the LSFcoefficients; and

FIG. 9 illustrates a perceptually-enhanced LPC spectrum used to weightthe dequantised spectral amplitudes.

It will be appreciated that the encoders and decoders describedhereinafter with reference to the drawings are implementedalgorithmically, as software instructions carried out in a suitabledesignated signal processor. The blocks shown in the drawings areintended to facilitate explanation of the function of each processingstep carried out by the processor, rather than to represent discretehardware components in the speech coder. Alternatively, of course, theencoders and decoders could be implemented using hardware components.

FIG. 1 is a generalised representation of a speech coder, comprising anencoder 1 and a decoder 2. In use, an analogue input speech signalS_(i)(t) is received at the encoder 1 where it is sampled, typically ata sampling frequency of 8 kHz. The sampled speech signal is then dividedinto frames and each frame is encoded to produce a set of quantisationindices which represent the waveform of the input speech signal, butcontain relatively few bits. The quantisation indices for successiveframes are transmitted to the decoder 2 over a communications channel 3,and the decoder 2 processes the received quantisation indices tosynthesize an analogue output speech signal S_(O)(t)corresponding to theoriginal input speech signal. In the case of a telecommunications linkusing a speech coder, the speech channel requires an encoder at thespeech signal input end and a decoder at the reception end. Therefore,the speech coder associated with one end of the telecommunications linkrequires both an encoder and a decoder which may be connected toseparate channels in the case of a duplex link or the same channel inthe case of a simplex link.

FIG. 2 shows the encoder of one embodiment of a speech coder accordingto the invention referred to hereinafter as a Split-Band LPC (SB-LPC)speech coder. The speech coder uses an Analysis and Synthesis scheme.

The described speech coder is designed to operate at a bit rate of 2.4kb/s; however, lower and higher bit rates are possible (for example, bitrates in the range from 1.2 kb/s to 6.8 kb/s) depending on the level ofquantisation used and the rate at which the quantisation indices areupdated.

Initially, the analogue input speech signal is low pass filtered toremove frequencies outside the human voice range. The low pass filteredsignal is then sampled at a sampling frequency of 8 kHz. The resultantdigital signal d_(i)(t) is then preconditioned by passing the signalthrough a high-pass filter 10 which, in this particular implementationhas a transfer function H(z) of the form${H_{1}(z)} = {\frac{1 - z^{- 1}}{1 - {0.9183z^{- 1}}}.}$

The effect of the high-pass filter 10 is to remove any DC level thatmight be present.

The preconditioned digital signal is then passed through a Hammingwindow 11 which is effective to divide the signal into frames. In thisexample, each frame is 160 samples long, corresponding to a frameup-date time interval of 20 ms. The coefficients W_(Hamm)(i) of theHamming window 11 are defined as${W_{Hamm}(i)} = {{0.54 - {0.46\cos \quad \left( \frac{2\pi \quad i}{159} \right)\quad {for}\quad 0}} \leq i \leq 159.}$

The frequency spectrum of each frame is then modelled on the output of alinear time-varying filter, more specifically an all-pole linearpredictive LPC filter 12 having a preset number L of LPC coefficientswhich are obtained using the known Levinson-Durbin algorithm. The LPCfilter 12 attempts to establish a linear relationship between each inputsample in the current frame and the L preceding samples. Therefore, ifthe i^(th) input sample is represented as a_(i) and the LPC coefficientsare represented as LPC(j), then the values of LPC(j) are chosen tominimise the expression:$\varepsilon = {\sum\limits_{i = 0}^{N}\left\lbrack {a_{i} - {\sum\limits_{j = 1}^{L}{{{LPC}\left( {j - 1} \right)}a_{i - j}}}} \right\rbrack^{2}}$

where, in this example, N=160 and L=10.

The LPC coefficients LPC(0),LPC(1) . . . LPC(9) are then transformed togenerate corresponding Line Spectral Frequency (LSF) coefficientsLSF(0), LSF(1) . . . LSF(9) for the frame. This is carried out inLPC-LSF transformer 13 using a known root search method.

The LSF coefficients are then passed to a vector quantiser 14 where theyundergo a vector quantisation process to generate an LSF quantisationindex L for the frame which is routed to a first output O₁ of theencoder. Alternatively, the LSF coefficients could be quantised usingscalar quantisers.

As is known, LSF coefficients are always monotonic and this makes thequantisation process easier than would be the case using LPCcoefficients. Furthermore, the LSF coefficients facilitateframe-to-frame interpolation, a process needed in the decoder.

The vector quantisation process takes account of the relativefrequencies of the LSF coefficients in such a way as to give greaterweight to coefficients which are relatively close in frequency andtherefore representative of a significant peak in the frequency spectrumof the input speech signal.

In this particular implementation of the invention, the LSF coefficientsare quantised using a total of 24 bits. The coefficients LSF(0),LSF(1),LSF(2) form a first group G₁ which is quantised using 8 bits,coefficients LSF(3),LSF(4),LSF(5) form a second group G₂ which isquantised using 8 bits and coefficients LSF(6),LSF(7),LSF(8),LSF(9) forma third group G₃ which is also quantised using 8 bits.

Each group of LSF coefficients is quantised separately. By way ofillustration, the quantisation process will be described in detail withreference to group G₁; however, substantially the same process is alsoused for groups G₂ and C₃.

The vector quantisation process is carried out using a codebookcontaining 2⁸ entries, numbered 1 to 256, the r^(th) entry in thecodebook consisting of a vector V_(r) of three elements V_(r)(0),V_(r)(1), V_(r)(2) corresponding to the coefficientsLSF(0),LSF(1),LSF(2) respectively. The aim of the quantisation processis to select a vector V_(r) which best matches the actual LSFcoefficients.

For each entry in the codebook, the vector quantiser 14 forms thesummation${\sum\limits_{i = o}^{i = 2}\left\lbrack {\left( {{V_{r}(i)} - {{LSF}(i)}} \right){W(i)}} \right\rbrack^{2}},$

where W(i) is a weighting factor, and the entry giving the minimumsummation defines the 8 bit quantisation index for the LSF coefficientsin group G₁.

The effect of the weighting factor is to emphasise the importance in theabove summations of the more significant peaks for which the LSFcoefficients are relatively close.

The RMS energy E_(o) of the 160 samples in the current frame n iscalculated in background signal estimation block 15 and this value isused to update the value of a background energy estimate E_(BG) ^(n)according to the following criteria:$E_{BG}^{n} = \left\{ \begin{matrix}{{\frac{E_{BG}^{n - 1}}{1.03}\quad {if}\quad E_{0}} < \frac{E_{BG}^{n - 1}}{1.03}} \\{{E_{BG}^{n - 1} \times 1.01\quad {if}\quad E_{0}} > {E_{BG}^{n - 1} \times 1.01}} \\{{E_{0}\quad {if}\quad \frac{E_{BG}^{n - 1}}{1.03}} \leq E_{0} \leq {E_{BG}^{n - 1} \times 1.01}}\end{matrix} \right.$

where E_(BG) ^(n−1) is the background energy estimate for theimmediately preceding frame, n−1.

If E_(BG) ^(n) is less than 1, then E_(BG) ^(n) is set at 1.

The values of E_(BG) ^(n) and E_(o) are then used to update the valuesof NRGS and NRGB which represent the expected values of the RMS energyof the speech and background components respectively of the input signalaccording to the following criteria:${NRGB}^{n} = \left\{ {{\begin{matrix}{{{NRGB}^{n - 1}\quad {if}\quad E_{o}} > {1.5\quad E_{BG}^{n}}} \\\begin{Bmatrix}{{0.5\left( {{NRGB}^{n - 1} + E_{o}} \right)\quad {if}\quad E_{o}} \leq {NRGB}^{n - 1}} \\{{{0.97\quad {NRGB}^{n - 1}} + {0.03\quad E_{o}\quad {if}\quad E_{o}}} > {NRGB}^{n - 1}}\end{Bmatrix}\end{matrix}\quad {if}\quad E_{o}} \leq {1.5\quad E_{BG}^{n}}} \right.$

and if NRGB^(n)<0.05 then NRGB^(n) is set at 0.05, and${NRGS}^{n} = \left\{ {{\begin{matrix}{{{NRGS}^{n - 1}\quad {if}\quad E_{o}} \leq {2.0\quad E_{BG}^{n}}} \\{\begin{Bmatrix}{{0.5\left( {{NRGS}^{n - 1} + E_{o}} \right)\quad {if}\quad E_{o}} > {NRGS}^{n - 1}} \\{{{0.99\quad {NRGS}^{n - 1}} + {0.01\quad E_{o}\quad {if}\quad E_{o}}} \leq {NRGS}^{n - 1}}\end{Bmatrix}\quad}\end{matrix}{if}\quad E_{o}} > {2\quad E_{BG}^{n}}} \right.$

and if NRGS^(n)<2.0, then NRGS^(n) is set at 2.0 and ifNRGB^(n)>NRGS^(n) then NRGS^(n) is set to NRGB^(n).

By way of illustration, FIG. 3 depicts the waveform of an analogue inputspeech signal S_(i)(t) contained within the interval (20 ms long) of thecurrent frame F₀. The waveform exhibits relatively large amplitude pitchpulses P_(u) which are an important characteristic of human speech. Thepitch or pitch period P for the frame is defined as the time intervalbetween consecutive pitch pulses in the frame and this can be expressedin terms of the number of samples contained within that time interval.The pitch period P is inversely related to the fundamental pitchfrequency ω_(o), where $\omega_{o} = {\frac{2\pi}{P}.}$

For speech sampled at 8 kHz it is reasonable to consider a pitch periodof from 15 to 150 samples, corresponding to a fundamental pitchfrequency in the range from about 50 Hz to 535 Hz. The fundamental pitchfrequency ω_(o) will, of course, be accompanied by a number of harmonicfrequencies.

As already explained, pitch period P is an important characteristic ofthe speech signal and therefore forms the basis of another quantisationindex P which is routed to a second output O₂ of the encoder.Furthermore, as will become clear, the pitch period P is central to thedetermination of other quantisation indices produced by the encoder.Therefore, considerable care is taken to evaluate the pitch period Pwith the required precision and in as reliable a manner as possible. Tothis end, a pitch detector 16 subjects each frame to analysis both inthe frequency domain and in the time domain using a pitch detectionalgorithm which is now described in detail with reference to FIG. 4.

To facilitate analysis in the frequency domain, a discrete Fouriertransform is performed in DFT block 17 using a 512 point fast Fouriertransform (FFT) algorithm. Samples are supplied to the DFT block 17 viaa 221 point Kaiser window 18 centred on the current frame and thesamples are padded with zeros to bring their number to 512.

Referring to FIG. 4, the magnitudes M(i) of the resultant frequencyspectrum are calculated in block 401 using the real and imaginarycomponents SWR(i) and SWI(i) of the transform, and in order to reducecomplexity this is done at each frequency i up to a predeterminedcut-off frequency (Cut), where i is expressed in terms of the outputsamples of the FFT running from 0 to 255. In this embodiment, thecut-off frequency is at i=90, corresponding to 1.5 kHz which far exceedsthe maximum expected fundamental pitch frequency.

The magnitudes M(i) are calculated as

M(i)=(SWR(i)² +SWI(i)²)^(½) for O≦i≦Cut−1

and the RMS value of M(i), M_(max) is calculated in block 402, as$M_{\max} = \left\lbrack {\frac{1}{Cut}{\sum\limits_{i = 0}^{i - {Cut} - 1}\left( {M(i)} \right)^{2}}} \right\rbrack^{\frac{1}{2}}$

In order to improve the performance of the pitch estimation algorithm,the magnitudes M(i) are preprocessed in blocks 404 to 407.

Initially, in block 404, a bias is applied in order to de-emphasise themain peaks in the frequency spectrum. If any magnitude M(i) exceedsM_(max) it is replaced by a new magnitude given by (M(i)M_(max))^(½). Afurther bias is then applied to emphasise the lower frequencies whichare more important in terms of their speech content, and, to this end,each magnitude is weighted by the factor$\left( {1 - \frac{i}{{Cut} + 5}} \right).$

To improve performance against background noise, a noise cancellationalgorithm is applied to the weighted magnitudes in block 405. To thisend, each magnitude M(i) is tracked during non-speech frames to obtainan estimate M_(mem)(i) of background noise. If E_(O)<1.5 E_(BG) ^(n) thevalue of M_(mem)(i) is up-dated to produce a new value M′_(mem)(i) givenby:

M′ _(mem)(i)=0.9 M _(mem)(i)+0.1 M(i)

If the ratio $\frac{{NRGS}^{n}}{{NRGB}^{n}}$

is less than a threshold value (typically in the range from 5 to 20) andno update of M_(mem) has taken place for the current frame indicatingthat the frame contains significant background noise in addition tospeech then the value kM′_(mem)(i) (where k is a constant, typically0.9) is subtracted from M(i) for each frequency i in the frequencyspectrum in order to reduce the effect of the background noise. If thedifference is negative or close to zero, less than a threshold value,0.0001 say, then M(i) is set at the threshold value.

The resultant magnitudes M′(i) are then analysed in block 406 to detectfor peaks. This is done by comparing each magnitude M′(i) (apart fromthose at the extremes of the frequency range) with its immediateneighbours M′(i−1) and M′(i+1), and if it is higher than both it isdeclared a peak. For each peak so detected its magnitude is stored asamp_(pk)(l) and its frequency is stored as freq_(pk)(l), where 1 is thenumber of the peak.

A smoothing algorithm is then applied to the magnitudes M′(i) in block407 to generate a relatively smooth envelope for the frequency spectrum.The smoothing algorithm is carried out in two stages. In the firststage, a variable x is initialised at zero and is compared with themagnitude M′(i) at each value of i starting at zero and finishing atCut−1. If x is less than M′(i), x is set to that value; otherwise, thevalue of M′(i) is set to x, and x is multiplied by an envelope decayfactor, 0.85 in this example. The same procedure is then carried outagain, but in the opposite direction, i.e. for values of i starting atCut−1 and finishing at zero.

The effect of this process is to generate a set of magnitudes a(i) for0≦i≦Cut−1 representing a smoothed, exponentially decaying envelope ofthe frequency spectrum; in particular, the process is effective toeliminate relatively small peaks residing next to larger peaks.

It will be apparent that the peak-detection process carried out in block406 will identify any peak, even small ones. In order to reduce theamount of processing in subsequent stages of the algorithm a peak isdiscarded by block 408 if its magnitude amp_(pk) is less than a factor ctimes the magnitude a(i) at the same frequency. In this example, c isset at 0.5.

The magnitude values a(i) generated in block 407, and the remainingamplitude and frequency values, amp_(pk) and freq_(pk) generated inblocks 406 and 408 are used in block 409 to evaluate a first estimate ofthe pitch period.

To this end, a function Met1 is evaluated for each candidate pitchperiod P in the range from 15 to 150. To reduce complexity this may bedone using steps of 0.5 up to the value 75, and steps of unitythereafter. Met1 is evaluated using the expression:${{{Met1}\left( \omega_{o} \right)} = \left. {{\sum\limits_{k = 1}^{k = {K{(\omega_{o})}}}{{a\left( {k\quad \omega_{o}} \right)}{\left( {k\quad \omega_{o}} \right)}}} - {\frac{1}{2}{\sum\limits_{k = 1}^{k = {K{(\omega_{o})}}}\left( {a\left( {k\quad \omega_{o}} \right)} \right)^{2}}}}\rightarrow{{EQ}\quad 1} \right.},$

where e(k, ω_(o))=Max₁(amp_(pk) (1)D(freq_(pk)(1)−kω_(o))),${\omega_{o} = \frac{2\pi}{P}},$

K(ω_(o)) is the number of harmonics below the cut-off frequency, andD(freq_(pk)(1)−kω_(o))=sinc (freq_(pk)(1)−kω_(o)).

In effect, this expression can be thought of as the cross-correlationfunction between the frequency response of a comb filter defined by theharmonic amplitudes a(kω_(o)) of the pitch candidate P and the optimumpeak amplitudes e(kω_(o)). The function D(freq_(pk)(1)−kω_(o)) is adistance measure related to the frequency separation between the l^(th)peak in the frequency spectrum and the k^(th) harmonic frequency of thepitch candidate P within a specified search distance. As e(kω_(o))depends on both the distance measure and on peak amplitude it ispossible that the optimum value e(kω_(o)) might not correspond to theminimum separation between the harmonic frequency kω_(o) and thefrequencies of the peaks.

Having evaluated Met1(ω_(o)) for each pitch candidate P the valuesobtained are multiplied by a weighting factor${b1} = \left( {1 - {0.1\frac{P}{150}}} \right)$

so as to bias the values slightly in favour of the smaller pitchcandidates.

The higher the value of Met1(ω_(o)), the greater the likelihood that thecorresponding pitch candidate is the actual pitch value. Moreover, ifthe pitch candidate is twice the actual pitch value (i.e. pitchdoubling) the value of Met1(ω_(o)) will be small; as will be described,this leads to the elimination of these unwanted pitch candidates at alater stage in the processing.

In order to identifly the most promising pitch candidates, peak valuesof Met1(ω_(o)) are detected in block 410. This is done by processing thevalues of Met1(ω_(o)) generated in block 409 to detect for a maximum ineach of five contiguous ranges of pitch, i.e. in pitch ranges 15 to27.5, 28 to 49.5, 50 to 94.5, 95 to 124.5, 125 to 150 and a maximumvalue within the range ±5 of a tracked pitch trP (to be describedlater). The five contiguous pitch ranges are so selected as to eliminatethe possibility of pitch doubling or pitch halving within each range;that is, a peak detected in a range cannot have twice or half of thepitch of any other peak in the same range. By this means, six peakvalues Met1(1),Met1(2),Met1(3),Met1(4),Met1(5),Met1(6) are retained forfurther processing along with their respective pitch valuesP₁,P₂,P₃,P₄,P₅,P₆. Although the value of ω_(o) which maximisesMet1(ω_(o)) provides a reasonable estimation of pitch value, it issometimes susceptible to error; in particular, it might sometimesidentify a pitch value which is half the actual pitch value (i.e. apitch halving).

To alleviate this problem, a second estimate of pitch is evaluated inblock 411 for each of the six candidate pitch values P₁,P₂,P₃,P₄,P₅,P₆derived from the first estimate.

The second estimate is evaluated using a time-domain analysis techniqueby forming different summations of the absolute values |d(i)| of theinput samples over a single pitch period P. To that end, the summation${f\left( {k,P} \right)} = {\sum\limits_{i = k}^{i = {k + P}}{{d(i)}}}$

is formed for each value of k between N−80 and N+79, where N is thesample number at the centre of the current frame. Thus, for eachcandidate pitch value P₁,P₂,P₃,P₄,P₅,P₆ a respective set of 160summations is generated, each summation in the set starting at adifferent position in the frame.

If a pitch candidate is close to the actual pitch value, there should belittle or no variation between the summations of the corresponding set.However, if the candidate and actual pitch values are very different(e.g. if the candidate pitch value is half the actual pitch value) therewill be significant variation between the summations of the set. Inorder to detect for any such variation, the summations of each set arehigh-pass filtered and the sum of the squares of the resultant high-passfiltered values is used to evaluate a second estimate Met2. A smalloffset value is added to reduce pitch multiple errors when the speech isextremely periodic. A respective second estimateMet2(1),Met2(2)Met2(3),Met2(4),Met2(5),Met2(6) is evaluated for each ofthe candidate pitch values P₁,P₂,P₃,P₄,P₅,P₆ selected using the firstestimate. Clearly, the smaller the value of Met2 the more likely thatthe corresponding pitch candidate is the actual pitch value. In the caseof pitch halving, the value of Met2 will be large and this facilitatesthe elimination of this unwanted pitch candidate.

Optionally, the input samples for the current frame may beautocorrelated in block 412 with a view to further improving thereliability of the first and second estimates Met1 and Met2. Thenormalised autocorrelations are examined to find the two highest values(V₁,V₂), and the corresponding lags L₁,L₂ (expressed as a number ofsamples) between consecutive occurrences of those values are alsodetermined. If the ratio between V₁ and V₂ exceeds a preset thresholdvalue (typically about 1.1), then the confidence is high that the valuesL₁L₂ are close to the correct pitch value. If so, the values of Met1 andMet2 for candidate pitch values which come close to L₁ or L₂ aremultiplied by respective weighting factors b₂ and b₃ to improve theirchances of selection in the final estimation of pitch value.

The values of Met1 and Met2 are further weighted in block 413 accordingto a tracked pitch value, trP. Provided the current frame containsspeech i.e. if E_(O)>1.5 E_(BG) ^(n), the value of trP is updated usingthe pitch value estimated for the immediately preceding frame, theextent of the up-date being greater for higher values of speech energy.The ratio, ${\gamma = \frac{P - {trP}}{trP}},$

is then evaluated for each candidate pitch value P₁,P₂,P₃,P₄,P₅,P₆.

In this example, if γ is less than 0.5, i.e. the candidate pitch valueis close to the tracked pitch value estimated from the pitch values ofearlier frames, the respective values of Met1 and Met2 are multiplied byfurther weighting factors b₄ and b₅ respectively. The values of b₄ andb₅ depend upon the level of background noise in the frame. If this isdetermined to be relatively high, e.g. ${\frac{NRGS}{NRGB} < 10},$

b₄ is set at 1.25 and b₅ is set at 0.85. However, if γ<0.3 (i.e. thecandidate pitch value is even closer to the tracked value) b₄ is set at1.56 and b₅ is set at 0.72. If it is determined that there is nosignificant background noise, e.g. ${\frac{NRGS}{NRGB} > 10},$

the extent of the bias is reduced—if γ<0.5, b₄ is set at 1.1 and b₅ isset at 0.9 and for γ<0.3, b₄ is set at 1.21 and b₅ is set at 0.8.

The weighted values of Met2 are then used to discard any candidate pitchvalue which is clearly unpromising. To this end, the weighted values ofMet2 are analysed in block 414 to detect for the minimum value and ifany other value exceeds this minimum by more than a preset factor (e.g.2.0) plus a constant (e.g. 0.1) it is discarded along with thecorresponding values of Met1(ω_(o)) and P.

As already described, if the pitch candidate is close to the correctvalue, Met1 will be very large and Met2 will be very small; therefore, aratio derived from Met1 and Met2 provides a very sensitive measure ofthe correctness or otherwise of the pitch candidates.

Accordingly, in block 415, the ratio${R = \frac{{Met}^{\prime}1}{{Met}^{\prime}2^{0.25}}},$

where Met′1 and Met′2 are the weighted values of Met1 and Met2, isevaluated for each of the remaining pitch candidates, and the candidatepitch value corresponding to the maximum ratio R is selected as theestimated pitch value P_(o) for the current frame. A check is then madeto confirm that the estimated pitch value P_(o) is not a submultiple ofthe actual pitch value. To this end, the ratio$S_{m} = \frac{P_{o}}{P_{n}}$

is calculated for each remaining candidate pitch value P_(n) andprovided this ratio is close to an integer greater than 1 (e.g. within0.3 of that integer), P_(o) is confirmed in block 416 as the estimatedpitch value for the frame.

The pitch algorithm described in detail with reference to FIG. 4 isextremely robust and involves the combination of both frequency and timedomain techniques to eliminate pitch doubling and pitch halving.

Although the pitch value P_(o) is estimated to an accuracy within 0.5samples or 1 sample depending on the range within which the candiatevalue falls, this accuracy may not be sufficient for the processingwhich needs to be carried out in subsequent stages of the encoder, andso better accuracy is needed. Therefore, a refined pitch value isestimated in pitch refinement block 19.

To facilitate this, a second discrete Fourier transform is performed inDFT block 20, again using a 512 point fast Fourier transformationalgorithm. As described earlier, samples were supplied to DFT block 17via a 221 point Kaiser window 18. This window is too wide for theprocessing techniques that are now required, and so a narrower window isneeded. Nevertheless, the window should still be at least three pitchperiods wide. Therefore, the input samples are supplied to DFT block 20via a variable length window 21 which is sensitive to the pitch valueP_(o) detected in pitch detector 16. In this example, three differentwindow sizes are used 221,181 and 161 respectively corresponding to theranges P_(o)>70, 70>P_(o)≧55 and 55>P_(o). Again, these are Kaiserwindows centred on the current frame.

The pitch refinement block 19 generates a new set of candidate pitchvalues containing fractional values distributed to either side of theestimated pitch value P_(o). In this embodiment, a total of 50 suchpitch candidate pitch values (including P_(o)) is used. A new value ofMet1 is then computed for each of these candidate pitch values, and thecandidate pitch value giving the maximum value of Met1 is selected asthe refined pitch value P_(ref) upon which all subsequent processingwill be based.

The new values of Met1 are computed in pitch refinement block 19 usingsubstantially the same process as that described earlier with referenceto FIG. 4, but with certain important modifications. Firstly, themagnitudes M(i) are calculated for the entire frequency spectrumgenerated by DFT block 20, instead of only for the low frequency rangeof the spectrum (i.e. values of i up to Cut−1). Secondly, the summationexpressed in Equation 1 above is performed in two parts; a first (lowfrequency) part for values of kω_(o) up to 1.5 kHz (corresponding toi=90), and a second (high frequency) part for the remaining values ofkω_(o), and these two parts of the summation are weighted by differentfactors, 0.25 and 1.0 respectively.

As already described, the estimated pitch value P_(o) was based on ananalysis of the low frequency range only and so any inaccuracy in thisestimate is largely attributable to the effect of the higher frequencieswhich were excluded from the analysis. In order to rectify thisomission, the higher frequencies are included in the analysis carriedout in block 19, and their effect is emphasised by the relativemagnitudes of the weighting factors applied to the respective parts ofthe summation. Furthermore, the bias originally applied to the magnitudevalues M(i) in block 404, and which had the (now unwanted) effect ofemphasising the lower frequencies is omitted from the analysis, andconsequently the value M_(max) (originally evaluated in block 402) isnot required either.

The refined pitch value P_(ref) generated in block 19 is passed tovector quantiser 22 where it is quantised to generate the pitchquantisation index P.

In this embodiment, the pitch quantisation index P is defined by sevenbits (corresponding to 128 levels), and the vector quantiser 22 is anexponential quantiser to take account of the fact that the human ear isless sensitive to pitch inaccuracies at larger pitch values. Thequantised pitch levels L_(p)(i) are defined as${{L_{p}(i)} = {15\left( \frac{150}{15} \right)^{\frac{i}{127}}}},{{{for}\quad 0} \leq i \leq 127.}$

It will be appreciated that at a sampling rate of 8 kHz as many as up to80 harmnonic frequencies may be contained within the 4 kHz bandwidth ofthe DFT block 20. Clearly, a very large number of bits would be neededto encode all these harmnonics individually, and this is not practicablein a speech encoder for which a relatively low bit rate is required, Amore economical encoding model is needed.

As will now be described with reference to FIG. 5, the actual frequencyspectrum derived from DFT block 20 is analysed in a voicing block 23 toset a voicing cut-off frequency F_(c) which divides the spectrum intotwo parts; a voiced part below the voicing cut-off frequency F_(c),which is the periodic component of speech and an unvoiced part which isthe random component of speech.

Once the voiced and unvoiced parts of the spectrum have been separatedin this way, they can be independently processed in the decoder withoutthe need to generate and transmit information about the voiced/unvoicedstatus of each individual harmonic band.

Each harmonic band is centred on a multiple k of a fundamental frequencyω_(o), given by $\frac{2\pi}{P_{ref}}.$

Initially, the shape of each harmonic band is correlated with the idealharmonic shape for the band (assuming it to be voiced) given by theFourier transform of the selected variable length window 21. This isdone by generating a correlation function S₁ for each harmonic band. Forthe k^(th) harmonic band,${{S_{1}(k)} = {\sum\limits_{a = a_{k}}^{a = b_{k}}{{M(a)}{W(m)}}}},\left. \rightarrow 2 \right.$

where M(a) is the complex value of the spectrum at position a In theFFT,

a_(k) and b_(k) are the limits of the summation for the band, and

W(m) is the corresponding magnitude of the ideal harmonic shape for theband, derived from the selected window, m being an integer defining theposition in the ideal harmonic shape corresponding to the position a inthe actual harmonic band, which is given by the expression:${m = {{{integer}\left( {{Sbt} \cdot \left( {a - {k\frac{SF}{P_{ref}}}} \right)} \right)}}},\left. \rightarrow 3 \right.$

where SF is the size of the FFT and Sbt is an up-sampling ratio, i.e.the ratio of the number of points in the window to the number of pointsin the FFT.

In addition to S₁, two normalisation functions S₂ and S₃ are generated,where${{S_{2}(k)} = {\sum\limits_{a = a_{k}}^{a = b_{k}}\left\lbrack {M(a)} \right\rbrack^{2}}},$

and${{S_{3}(k)} = {\sum\limits_{a = a_{k}}^{a = b_{k}}\left\lbrack {W(m)} \right\rbrack^{2}}},$

These three functions S₁(k),S₂(k) and S₃(k) are then combined togenerate a normalised correlation function V(k) given by,${V(k)} = \left\lbrack \frac{S_{1}^{2}(k)}{{S_{2}(k)} \cdot {S_{3}(k)}} \right\rbrack$

where k is the number of harmonic bands. V(k) is further biassed byraising it to the power of $1 + {\frac{3\left( {k - 10} \right)}{40}.}$

If there is exact correlation between the actual and the ideal harmonicshapes, the value of V(k) will be unity. FIG. 5 shows the form of atypical normalised correlation function V(k) for the case of a frequencyspectrum for which the total number K of harmonic bands is 25 (i.e. k=1to 25). As shown in this Figure, the harmonic bands at the low frequencyend of the spectrum are relatively close to unity and are thereforelikely to be voiced.

In order to set a value for F_(c), the function V(k) is compared with acorresponding threshold function THRES(k) at each value of k. The formof a typical threshold function THRES(k) is also shown in FIG. 5.

In order to compute THRES(k) the following values are used:

E−lf, E−hf, tr−E−lf, tr−E−hf, ZC, L₁,L₂,PKY1, PKY2, T₁,T₂. These aredefined as follows:${E - {lf}} = {\sum\limits_{i = 0}^{{\frac{1}{2}{SF}} - 1}{M^{2}(i)}}$${E - {hf}} = {\sum\limits_{i = {{SF}/2}}^{{SF} - 1}{M^{2}(i)}}$

If (E_(u) ^(n)<2 E_(BG) ^(n)) and the frame counter is less than 20,

tr ^(n) −E−lf=0.9tr ^(n−1) −E−lf+0.1E ^(n) −lf, and

tr ^(n) −E−hf=0.9tr ^(n−1) −E−lf+0.1E ^(n) −hf,

Otherwise, if (E_(o) ^(n)<1.5 E_(BG) ^(n)),

tr ^(n) −E−lf=0.97tr ^(n−1) −E−lf+0.03E ^(n)−lf, and

tr ^(n) −E−hf=0.97tr ^(n−1) −E−hf+0.03E ^(n)−hf.

Also, tr ^(o) −E−hf=10⁸,

and tr ^(o) −E−lf=10⁷.

ZC is set to zero, and for each i between −N/2 and N/2

ZC=ZC+1 if ip [i]x ip [i−i]<O,

where ip is input speech referenced so that ip [0] corresponds to theinput sample lying in the centre of the window used to obtain thespectrum for the current frame.${L_{1} = {\frac{1}{N}{\sum\limits_{i = {{- N}/2}}^{{N/2} - 1}{{{residual}(i)}}}}},{and}$${L_{2} = \left\lbrack {\frac{1}{N}{\sum\limits_{i = {N/2}}^{{N/2} - 1}\left( {{residual}(i)} \right)^{2}}} \right\rbrack^{\frac{1}{2}}},$

where residual (i) is an LPC residual signal generated at the output ofa LPC inverse filter 28, and referenced so that residual (0) correspondsto ip(o).

PKY1=L2/L1

and ${{PKY2} = \frac{{L2}^{\prime}}{{L1}^{\prime}}},$

where L1′,L2′ are calculated as for L1,L2 respectively, but excluding apredetermined number of values to either side of the maximum residualvalue averaged over a correspondingly reduced number of terms. PKY1 andPKY2 are both indications of the “peakiness” of the residual speech, butPKY2 is less sensitive to exceptionally large peaks.${T_{1} = {\sum\limits_{i = {{- N}/2}}^{{N/2} - 1}{{{{ip}\lbrack i\rbrack} - {{ip}\left\lbrack {i - 1} \right\rbrack}}}}},{T_{2} = {\sum\limits_{i = {{- N}/2}}^{{N/2} - 1}{{{ip}\lbrack i\rbrack}}}}$

If (NRGS<30×NRGB) i.e. noisy background conditions prevail, and if(E−lf>tr−E−If) and (E−hf>tr−E−hf), then a low-to-high frequency energyratio (LH−Ratio) is given by the expression${{{LH} - {Ratio}} = \frac{E - {lf} - {0.9{tr}} - E - {lf}}{E - {hf} - {0.9{tr}} - E - {hf}}},$

and if (E−lf<tr−E−lf), then

LH−Ratio=0.02,

and if E−hf<tr−E−hf, then

LH−Ratio=1.0,

and LH−Ratio is clamped between 0.02 and 1.0.

In these noisy background conditions, two different situations exist;namely, case 1 where the threshold value THRES(k) in the immediatelypreceding frame lay below the cut-off frequency F_(c) for that frame,and case 2 wherein the threshold value THRES(k) in the immediatelypreceding frame lay above the cut-off frequency F_(c) for that frame.

If (LH−Ratio<0.2), then for Case 1,

THRES(k)=1.0−½(1.0−{fraction (1/π)}(k−1)ω_(o)), and for Case 2

THRES(k)=1.0−⅓(1.0−{fraction (1/π)}(k−1)ω_(o)), and these values arethen modified as follows:

THRES(k)=1.0−(1.0−THRES(k))(LH−Ratio×5)^(½).

If LH−Ratio>0.2, then for Case 1,

THRES(k)=1.0−½(1.0−{fraction (1/π)}(k−1)ω_(o)×0.125), and for case 2,

THRES(k)=1.0−⅓(1.0−{fraction (1/π)}(k−1)ω_(o)×0.125) and if

(LH−Ratio≧1.0) these values are modified as follows:

THRES(k)=1−(1−THRES(k))^(½).

Defining an energy ratio,${{ER} = {2.0\quad \frac{E_{0}}{E_{0} + {E\quad \max}}}},$

where E_(o) is the energy of the entire frequency spectrum, given by$E_{0} = {\sum\limits_{1 = 0}^{{SF} - 1}\quad \left( {M(i)} \right)^{2}}$

and Emax is an estimate of the maximum energy encountered in recentframes (where ER is set at 0.1 if ER<0.1), then if (ER<0.4), the abovethreshold values are further modified as follows:

THRES(k)=1.0−(1.0−THRES(k)) (2.5 ER)^(½), and

if (ER>0.6), the threshold values are further modified as follows:

THRES(k) 1.0−(1.0−THRES(k))^(½).

Furthermore, if (THRES(k)>0.85), these modified values are subjected toa yet further modification as follows:

THRES(k)=0.85+½(THRES(k)−0.85).

Finally, if ¾K≦k≦K, then the values of THRES(k) are modified stillfurther as follows:

THRES(k)=1.0−½(1.0−THPES(k)).

In clean background conditions (i.e. NRGS>30.0 NRGB) then for Case 1,

THRES(k)=1.0−0.6(1.0−{fraction (1/π)}( k−1)×0.25),

and for Case 2,

THRES(k)=1.0−0.45(1.0−{fraction (1/π)}( k−1)×0.25),

These values then undergo successive modifications according to thefollowing conditions:

(i) if (E−lf/E−hf<2.0), then

THRES(k)=1−(1−THRES(k))$\left( \frac{E - {1f}}{{2.0\quad E} - {hf}} \right)$

(ii) if (T₂/T₁<1), then

THRES(k)=1−(1−THRES(k)) $\left( \frac{T_{2}}{T_{1}} \right)^{2}$

(iii) if (T₂/T₁>1.5), then

THRES(k)=1−(1−THRES(k))^(½),

(iv) if (ZC>60), then

THRES(k)=1−(1−THRES(k)) $\left( \frac{60}{ZC} \right)^{2}$

(v) if (ER<0.4), then

THRES(k)=1−2.5 ER (1−THRES(k))

(vi) if (ER>0.6), then

THRES(k)=1−(THRES(k))^(½), and finally

(vii) if (THRES(k)>0.5), then

THRES(k)=1−1.6 (1−THRES(k)), otherwise

THRES(k)=0.4 THRES(K).

The input speech is low-pass filtered and the normalisedcross-correlation is then computed for integer lag values P_(ref)−3 toP_(ref)+3, and the maximum value of the cross-correlation CM isdetermined.

The value of THRES(k) derived above for noisy and clean backgroundconditions are then further modified according to the first condition tobe satisfied in the following hierachy of conditions:

1. If (PKY1>1.8) and (PKY2>1.7),

THRES(k)=0.5 THRES(k).

2. If (PKY1>1.7) and (CM>0.35),

THRES(k)=0.45 THRES(k).

3. If (PKY1>1.6) and (CM>0.2),

THRES(k)=0.55 THRES(k).

4. If (CM>0.85) or (PKY1>1.4 and CM>0.5) or (PKY1>1.5 and CM>0.35),

THRES(k)=0.75 THRES(k).

5. If (CM<0.55) and (PKY1<1.25),

THRES(k)=1−0.25 (1−THRES(k))

6. If (CM<0.7) and PKY1<1.4,

THRES(k)=1−0.75 (1−THRES(k)).

Finally, if (E−OR>0.7) and (ER<0.11) or if (ZC>90), then${E - {OR}} = \frac{\sum\limits_{i = {{- N}/2}}^{{N/2} - 1}\quad {{residual}^{2}(i)}}{\sum\limits_{i = {{- N}/2}}^{{N/2} - 1}\quad {{ip}^{2}(i)}}$

A summation S_(v) is then formed as follows:

S _(v)=Σ_(k=1) ^(k() V(k)−THRES(k))(2t _(voice)(k)−1)×B(k)

where B(k)=5S₃, if V(k)>THRES(k), otherwise B(k)=S₃, and t_(voice)(k)takes either the value “1” or the value “0”.

In effect, the values t_(voice)(k) define a trial voicing cut-offfrequency F_(c) such that t_(voice)(k) is “1” at all values of k belowF_(c) and is “0” at all values of k above F_(c). FIG. 5 shows a firstset of values t¹ _(voice)(k) defining a first trial cut-off frequency F¹_(c), and a second set of values t² _(voice)(k) defining a second trialcut-off frequency F² _(c). In this embodiment, the summation S_(v) isformed for each of eight different sets of values t¹ _(voice)(k),t²_(voice)(k) . . . t⁸ _(voice)(k), each defining a different trialcut-off frequency F¹ _(c),F² _(c). . . F⁸ _(c). The set of values givingthe maximum summation S_(v) will determine the voicing cut-off frequencyfor the frame.

It will be appreciated that the effect of the function (2t_(voice)(k)−1)in the above summation is to reverse the sign of the difference value(V(k)−THRES(k)) whenever t_(voice)(k) has the value “0”, i.e. at valuesof k above the cut-off frequency. In the example shown in FIG. 5, theeffect of the function (2t_(voice)(k)−1) is to determine whether thevoicing cut-off frequency F_(c) should be set at a value F¹ _(c) whichis below dip D in the correlation function V(k) or at a higher value F²_(c) above the dip. In the range of k referenced N in FIG. 5, the valueV(k) is less than the value THRES(k) and so the difference value(V(k)−THRES(k)) in the summation S_(v) is negative. If the first set ofvalues t¹ _(voice)(k) is used their effect is to reverse the sign of(V(k)−THRES(k)) in the range N, resulting in a positive contribution tothe overall summation.

In contrast if the second set of values t² _(voice)(k) is used theireffect is to maintain unchanged the sign of (V(k)−THRES(k)) in the rangeN, resulting in a negative contribution to the overall summation. In therange of k referenced P in FIG. 5, the opposite will be the case; thatis, the first set of values t¹ _(voice)(k) will result in a negativecontribution to the summation for the range, whereas the second set ofvalues t² _(voice)(k) will result in a positive contribution to thesummation. However, as will be apparent from the relative areas of therespective cross-hatched regions in FIG. 5, the effect of the differencevalues (V(k)−THRES(k)) in range N is much greater than in range P andso, in this example, the first set of values t¹ _(voice)(k) will givethe maximum summation S_(v), and would be used to determine the voicingcut-off frequency (F¹ _(c)) for the frame.

Having selected a value of F_(c) from the eight possible values, thecorresponding index (1 to 8) provides the voicing quantisation index Vwhich is routed to a third output O₃ of the encoder via voicingquantiser 24. The quantisation index V is defined by three bitscorresponding to the eight possible frequency levels.

Having established values for pitch, P_(ref) and voicing cut-offfrequency, F_(c) for the current frame, the spectral amplitude of eachharmonic band is evaluated in amplitude determination block 25. Thespectral amplitudes are derived from a frequency spectrum produced byperforming a discrete Fourier transform in block 27 (implemented as aFast Fourier Transform) on a windowed LPC residual signal generated atthe output of LPC inverse filter 28. Filter 28 is supplied with theoriginal input speech signal and with a set of regenerated LPCcoefficients generated by dequantising the LSF quantisation indices inLSF dequantiser 29 and transforming the dequantised LSF values in anLSF-LPC transformer 30.

If an harmonic band (the k^(th) band say) lies in the unvoiced part ofthe frequency spectrum; that is, it lies above the voicing cut-offfrequency F_(c), the spectral amplitude amp(k) of the band is given bythe RMS energy in the band, expressed as${{{amp}\quad (k)} = {\left\lbrack \frac{\sum\limits_{a = a_{k}}^{a = b_{k}}\quad {{M_{r}(a)}}^{2}}{b_{k} - a_{k}} \right\rbrack^{\frac{1}{2}}\beta}},$

where M_(r)(a) is the complex value at position a in the frequencyspectrum derived from LPC residual signal calculated as before from thereal and imaginary parts of the FFT, and a_(k) and b_(k) are the limitsof the summation for the k^(th) band, and β is a normalisation factorwhich is a function of the window.

If, on the other hand, the harmonic band lies in the voiced part of thefrequency spectrum; that is, it lies below the voicing cut-off frequencyF_(c) the spectral amplitude amp(k) for the k^(th) band is given by theexpression${{amp}{\quad \quad}(k)} = \left\lbrack \frac{{\sum\limits_{a = a_{k}}^{a = b_{k}}\quad {{M_{r}(a)}\quad {W(m)}}}}{\sum\limits_{a = a_{k}}^{a = b_{k}}\left\lbrack {W(m)} \right\rbrack^{2}} \right\rbrack^{\frac{1}{2}}$

where W(m) is as defined with reference to Equations 2 and 3 above.

The spectral amplitudes obtained in this way are normalised to haveunity mean.

The normalised spectral amplitudes are then quantised in amplitudequantiser 26. It will be appreciated that this may be done using avariety of different quantisation schemes depending upon the number ofavailable bits. In this particular embodiment, a vector quantisationprocess is used and reference is made to the LPC frequency spectrum P(ω)for the frame. The LPC frequency spectrum P(ω) represents the frequencyresponse of the LPC filter 12 and has the form${P(\omega)} = \frac{1}{1 - {\sum\limits_{l = 1}^{L}\quad {{LPC}\quad (l)^{- {j\omega l}}}}}$

where LPC(1) are the LPC coefficients. In this embodiment there are 10LPC coefficients, i.e. L=10.

The LPC frequency spectrum P(ω) is shown in FIG. 6a and thecorresponding spectral amplitudes amp(k) are shown in FIG. 6b. In thisexample, only 10 harmonic bands (k=1 to 10) are shown.

The LPC frequency spectrum is examined to find four harmonic bandscontaining the highest magnitudes and, in this illustration, these arethe harmonic bands for which k=1,2,3 and 5. As illustrated in FIG. 6c,the corresponding spectral amplitudes amp(1),amp(2),amp(3),amp(5) formthe first four elements V(1),V(2),V(3),V(4) of an eight element vector,and the last four elements of the vector (V(5) to V(8)) are formed fromthe six remaining spectral amplitudes, amp(4) and amp(6) to amp(10), byappropriate averaging. To this end, element V(5) is formed by amp(4),element V(6) is formed by the average of amp(6) and amp(7), element V(7)is formed by amp(8) and element V(8) is formed by the average of amp(9)and amp(10).

The vector quantisation process is carried out with reference to theentries in a codebook, and the entry which best matches the assembledvector (using a mean squared error measure weighted by the LPC spectralshape) is selected as the first part S1 of an amplitude quantisationindex S for the frame.

In addition, a second part S2 of the amplitude quantisation index S iscomputed as the RSM energy R_(m) of the original speech input of theframe.

The first part of the amplitude quantisation index S1 represents the“shape” of the frequency spectrum, whereas the second part of theamplitude quantisation index S2 represents the scale factor related tothe volume of the speech signal. In this embodiment, the first part ofthe index S1 consists of 6 bits (corresponding to a codebook containing64 entries, each representing a different spectral “shape”) and thesecond part of the index S2 consists of 5 bits. The two parts S1,S2 arecombined to form a 11 bit amplitude quantisation index S which isforwarded to a fourth output O₄ of the encoder.

Depending upon the number of available bits a variety of differentschemes can be used to quantize the spectral amplitude. For example, thequantisation codebook could contain a larger or smaller number ofentries, and each entry may comprise a vector consisting of a larger orsmaller number of amplitude values.

As will be described hereinafter, the decoder operates on the indices S,P and V to synthesise the residual signal whereby to generate anexcitation signal which is supplied to the decoder LPC synthesis filter.

In summary, the encoder generates a set of quantisation indices LPC, ES,Y, S1 and S2 for each frame of the input speech signal.

The encoder bit rate depends upon the number of bits used to define thequantisation indices and also upon the update rate of the quantisationindices.

In the described example, the update period for each quantisation indexis 20 ms (the same as the frame update period) and the bit rate is 2.4kb/s. The number of bits used for each quantisation index in thisexample is summarised in Table 1 below.

TABLE 1 BIT RATE (kb/s) 2.4 1.2 3.9 4.0 5.2 6.8 UP-DATE PERIOD 20  40 2020 20 20 (ms) 20 20 10 10 10 10 10 10 10 10 NO OF BITS LPC 24  4 24 2820 20 28 28 28 P 7 7 7 5 7 5 7 5 7 7 V 3 3 4 4 3 3 4 4 5 5 S1 6 0 8 8 66 21 21 21 21 S2 5 5 5 7 7 5 5 7 7 7 7 NO OF BITS/FRAME 45* 48 78 80 104136 *Three additional bits (giving a total of 48 bits) can either beused for better quantisation of parameters or for synchronisation anderror protection.

Table 1 also summarises the distribution of bits amongst thequantisation indices in each of five further examples, in which thespeech encoder operates at 1.2 kb/s, 3.9 kb/s, 4.0 kb/s, 5.2 kb/s and6.8 kb/s respectively.

In some of these examples, some or all of the quantisation indices areupdated at 10 ms intervals, i.e. twice per frame. It will be noted thatin such cases the pitch quantisation index P derived during the first 10ms update period in a frame may be defined by a greater number of bitsthan the pitch quantisation index P derived during the second 10 msupdate period. This is because the pitch value derived during the firstupdate period is used as a basis for the pitch value derived during thesecond update period, and so the latter pitch value can be defined usingfewer bits.

In the case of the 1.2 kb/s rate, the frame length is 40 ms. In thiscase, the pitch and voicing quantisation indices P, V are determined forone half of each frame, and the indices for another half of the frameare obtained by extrapolation from the respective parameters in adjacenthalf frames.

The LSF coefficients (LSF2,LSF3) for the leading and trailing halves ofthe current 40 ms frame are quantised with reference to each other andwith reference to the LSF coefficients (LSF1) for the trailing half ofthe immediately preceding frame and the corresponding LSF quantisationvector.

Target quantised LSF coefficients (LSF′1, LSF′2, LSF′3) for each halfframe are given by the sum of a respective prediction value (P1, P2, P3)for that half frame and a respective LSF quantisation vector (Q1, Q2,Q3) contained in a vector quantisation codebook, where

 LSF′1=P 1+Q 1,

LSF′2=P 2+Q 2, and

LSF′3=P 3+Q 3.

Each prediction value P2, P3 is obtained from the respective LSFquantisation vector Q1, Q2 for the immediately preceding half frame,such that:

P 2=λQ 1, and

P 3=λQ 2,

where λ is a constant prediction factor, typically in the range from 0.5to 0.7.

To reduce the bit rate, it is useful to define the target quantised LSFcoefficients LSF′2 (for the leading half of the current frame) in termsof the target quantised LSF coefficients (LSF′1, LSF′3) for the adjacenthalf frames. Thus,

LSF′2′αLSF′1+(1−α)LSF′3,  →Eq 4

where α is a vector of 10 elements in a sixteen entry codebookrepresented by a 4-bit index.

By substitution of the foregoing equations it can be shown that

LSF′3 (1−λ−λα)=Q 3+λαLSF′1−λ² Q 1  →Eq 5

The only variables in equations 4 and 5 above are the vectors α and Q3,and these vectors are varied to minimise an error function ε (which maybe perceptually weighted) given by

ε=(LSF′3−LSF 3)²+(LSF′2−LSF 2)²,

which represents a measure of distortion between the actual andquantised LSF coefficients in the current frame.

The respective codebooks are searched to discover the combination ofvectors α and Q3 giving the minimum error function ε, and the selectedentries in the codebooks respectively define 4 and 24 bit components ofa 28 bit LSF quantisation index for the current frame. In a mannersimilar to that described earlier with reference to the 2.4 kb/sencoder, the LSF quantisation vectors contained in the vectorquantisation codebook consist of three groups each containing 2⁸entries, numbered 1 to 256, which correspond to the first three, thesecond three and the last four LSF coefficients. The selected entry ineach group defines an eight bit quantisation index, giving a total of 24bits for the three groups.

The speech coder described with reference to FIGS. 3 to 6 may operate ata single bit rate. Alternatively, the speech coder may be an adaptivemulti-rate (AMR) coder selectively operable at any one of two or moredifferent bit rates. In a particular implementation of this, the AMRcoder is selectively operable at any one of the aforementioned bit rateswhere, again, the distribution of bits amongst the quantisation indicesfor each rate is summarised in Table 1.

The quantisation indices generated at outputs O₁,O₂,O₃ and O₄ of thespeech encoder are transmitted over the communications channel to thedecoder, shown in FIG. 7. In the decoder the quantisation indices areregenerated and are supplied to inputs I₁,I₂,I₃ and I₄ of dequantisationblocks 30,31,32 and 33 respectively.

Dequantisation block 30 outputs a set of dequantised LSF coefficientsfor the frame and these are used to regenerate a corresponding set ofLPC coefficients which are supplied to an LPC synthesis filter 34.

Dequantisation blocks 31,32 and 33 respectively output dequantisedvalues of pitch (P_(ref)), voicing cut-off frequency (F_(c)) andspectral amplitude (amp(k)) together with the RMS energy R_(m), andthese values are used to generate an excitation signal E_(x) for the LPCsynthesis filter 34. To this end, the values P_(ref), Fc, amp(k) andR_(m) are supplied to a first excitation generator 35 which synthesisesthe voiced part of the excitation signal (i.e. the part containingfrequencies below F_(c)) and to a second excitation generator 36 whichsynthesises the unvoiced part of the excitation signal (i.e. the partcontaining frequencies above F_(c)).

The first excitation generator 35 generates a respective sinusoid at thefrequency of each harmonic band; that is at integer multiples of thefundamental pitch frequency$\omega_{0} = \left( \frac{2\quad \pi}{P_{ref}} \right)$

up to the voicing cut-off frequency F_(c). To this end, the firstexcitation generator 35 generates a set of sinusoids of the formA_(k)cos(kθ), where k is an integer.

Using the dequantised pitch value (Pref), the beginning and end of eachpitch cycle within the synthesis frame is determined, and for each pitchcycle a new set of parameters is obtained by interpolation.

The phase θ(i) at any sample i is given by the expression

θ(i)=θ(i−1)+2π[ω_(last)(1−x)+ω_(o) ·x],

where ω_(last) is the fundamental pitch frequency determined for theimmediately preceding frame, and $x = \frac{k}{F}$

where F is the total number of samples in a frame, and k is the sampleposition of the middle of the current pitch cycle being synthesised inthe current frame.

The term ω_(last)(1−x)+ω_(o)·x in the above expression causes aprogressive shift in the phase, pitch cycle-by-pitch cycle, to ensure asmooth phase transition at the frame boundaries. The amplitude A_(k) ofeach sinusoid is related to the product amp(k). R_(m) for the currentframe; however, interpolation between the amplitudes of the current andimmediately preceding frames carried out on a pitch cycle-to-pitch cyclebasis may be applied, as follows:

(i) If an harmonic frequency band lies in the unvoiced part of thefrequency spectrum in the current frame but lay in the voiced part ofthe frequency spectrum in the immediately preceding frame it is assumedthat the speech signal is tailing off. In this case, a sinusoid is stillgenerated by excitation generator 35 for the current frame, but usingthe amplitude of the earlier frame, scaled down by a suitable rampingfactor (which is preferably held constant over each pitch cycle) overthe length of the current frame.

(ii) If an harmonic frequency band lies in the voiced part of thefrequency spectrum in the current frame but lay in the unvoiced part ofthe frequency spectrum in the immediately preceding frame it is assumedthat there is an onset in the speech signal. In this case, the amplitudeof the current frame is used, but scaled up by a suitable ramping factor(which, again, is preferably held constant over each pitch cycle) overthe length of the frame.

(iii) If an harmonic frequency band lies in the voiced part of thefrequency spectrum in both the current and the immediately precedingframes, normal speech is assumed. In this case, the amplitude isinterpolated between the current and previous amplitude values over thelength of the current frame.

Alternatively, voiced part synthesis can be implemented by an inverseDFT method, where the DFT size is equal to the interpolated pitchlength. In each pitch cycle the input to the DFT consists of the decodedand interpolated spectral amplitudes up to the point of the interpolatedcut-off frequencies F_(c), and zeros thereafter.

The second excitation generator 36 used to synthesise the unvoiced partof the excitation signal includes a random noise generator whichgenerates a white noise sequence. An “overlap and add” technique is usedto extract from this sequence a series of P_(ref) samples correspondingto the current interpolated pitch cycle. This is accomplished using atrapezoidal window having an overall width of 256 samples and which isslid along the white noise sequence, frame-by-frame, in steps of 160samples. The windowed samples are subjected to a 256-point fast Fouriertransform and the resultant frequency spectrum is shaped by thedequantised spectral amplitudes. In the frequency range above F_(c),each harmonic band, k, in the frequency spectrum is shaped by thedequantised and scaled spectral amplitude R_(m)amp(k) for the band, andin the frequency range below F_(c) (which corresponds to the voiced partof the spectrum) the amplitude of each harmonic band is set to zero. Aninverse Fourier transform is then applied to the shaped frequencyspectrum to produce the unvoiced excitation signal in the time domain.The samples corresponding to the current pitch cycle are then used toform the unvoiced excitation signal. The use of an “overlap and add”technique enhances the smoothness of the decoded speech signal.

The voiced excitation signal generated by the first excitation generator35 and the unvoiced excitation signal generated by the second excitationgenerator 36 are added together in adder 37 and the combined excitationsignal Ex is output to the LPC synthesis filter 34. The LPC synthesisfilter 34 receives interpolated LPC coefficients derived from thedecoded LSF coefficients and uses these to filter the combinedexcitation signal to synthesise the output speech signal S_(o)(t).

In order to generate a smooth output speech signal S_(o)(t) any changein the LPC coefficients should be gradual, and so interpolation isdesirable. It is not possible to interpolate between LPC coefficientsdirectly; however, it is possible to interpolate between LSFcoefficients.

If consecutive frames are completely filled with speech so that the RMSenergies in the frame are substantially the same, the two sets of LSFcoefficients for the frames are not too dissimilar and so a linearinterpolation can be applied between them. However, a problem wouldarise if a frame contains speech and silence; that is, the framecontains a speech onset or a speech tail-off. In this situation, the LSFcoefficients for the current frame and the LSF coefficients for theimmediately preceding frame would be very different and so a linearinterpolation would tend to distort the true speech pattern resulting innoise.

In the case of a speech onset, the RMS energy E_(c) in the current frameis greater than the RMS energy E_(p) in the immediately preceding frame,whereas in the case of speech tail-off the reverse is true.

With a view to alleviating this problem an energy-dependentinterpolation is applied. FIG. 8 shows the variation of interpolationfactor across the frame for different ratios $\frac{E_{p}}{E_{c}}$

ranging from 0.125 (speech onset) to 8.0 (speech tail-off). It can beseen from FIG. 8, that the effect of the energy-dependent interpolationfactors is to impose a bias toward the more significant set of LSFcoefficients so that voiced parts of the frame are not passed through afilter more appropriate to background noise.

The interpolation procedure is applied to the LSF coefficients in LSFInterpolator 38 and the interpolated values so obtained are passed to aLSF-LPC Transformer 39 where the corresponding LPC coefficients aregenerated.

In order to enhance speech quality it has been customary, hitherto, toperform post-processing on the synthesised output speech signal toreduce the effect of noise in the valleys of the LPC frequency spectrum,where the LPC model of speech is relatively poor. This can beaccomplished using suitable filters; however, such filtering inducessome spectral tilt which muffles the final output signal and so reducesspeech quality.

In this embodiment, a different technique is used; more specifically,instead of processing the output of the LPC synthesis filter 34, as hasbeen done in the past, the technique used in this embodiment relies onweighting the spectral amplitudes generated at the output of decoderblock 33. The weighting factor Q(kω_(o)) applied to the k^(th) spectralamplitude is derived from the LPC spectrum P(ω) described earlier. LPCspectrum P(ω) is peak-interpolated to generate a peak-interpolatedspectrum H(ω), and the weighting function Q(ω) is given by the ratio ofP(ω) and H(ω), raised to the power λ; that is:${Q(\omega)} = \left\lbrack \frac{P(\omega)}{H(\omega)} \right\rbrack^{\lambda}$

where λ is in the range from 0.00 to 1.0 and is preferably 0.35.

The functions P(ω) and H(ω) are shown in FIG. 9 along with theperceptually-enhanced LPC spectrum given by Q(ω))P((ω).

As can be seen from this Figure, the effect of the weighting functionQ((ω) is to reduce the value of the LPC spectrum in the valley regionsbetween peaks, and so reduce the noise in these regions. When theappropriate weights Q(kω_(o)) are applied to the dequantised spectralamplitudes amp(k) in perceptual weighting block 40 their effect is toimprove the quality of the output speech signal, as though it had beensubjected to post-processing, but without causing spectral tilt and theassociated muffling associated with the post-processing technique usedin the past.

Since the output of the LPC synthesis filter 34 can fluctuate in energy,the output is preferably controlled. This is done in two stages, usingthe optional circuit shown in broken outline in FIG. 7. In the firststage, the actual pitch cycle energy is computed in block 41 and thisenergy is compared with the desired interpolated pitch cycle energy in aratioing circuit 42 to generate a ratio value. The corresponding pitchcycle of the excitation signal E_(x) is then multiplied by this ratiovalue in multiplier 43 to reduce a difference between the comparedenergies and then passed to a further lpc synthesis filter 44 whichsynthesises the smoothed output speech signal.

What is claimed is:
 1. A speech coder including an encoder for encoding an input speech signal divided into frames each consisting of a predetermined number of digital samples, the encoder including: linear predictive coding (LPC) means for analysing samples and generating at least one set of linear prediction coefficients for each frame; pitch determination means for determining at least one value of pitch for each frame, the pitch determination means including first estimation means for analysing samples using a frequency domain technique (frequency domain analysis), second estimation means for analysing samples using a time domain technique (time domain analysis) and pitch evaluation means for using the results of said frequency domain and time domain analyses to derive a said value of pitch; voicing means for defining a measure of voiced and unvoiced signals in each frame, amplitude determination means for generating amplitude information for each frame, and quantisation means for quantising said set of linear prediction coefficients, said value of pitch said measure of voiced and unvoiced signals and said amplitude information to generate a set of quantisation indices for each frame, wherein said first estimation means generates a first measure of pitch for each of a number of candidate pitch values, the second estimation means generates a respective second measure of pitch for each of said candidate pitch values and said evaluation means combines each of at least some of the first measures with the corresponding said second measure and selects one of the candidate pitch values by reference to the resultant combinations.
 2. A speech coder as claimed in claim 1, wherein said evaluation means form said combinations by forming a ratio from each said first measure and the corresponding second measure and selects said one candidate pitch value by reference to the ratios so formed.
 3. A speech coder as claimed in claim 1, wherein the evaluation means compares each said candidate pitch value with a tracked pitch value derived from one or more earlier frames and weights the corresponding said first and second measures by respective amounts in dependence on the comparison before said measure are combined.
 4. A speech coder as claimed in claim 3 wherein the amounts of the weighting depend also on the level of background noise in the current frame.
 5. A speech coder as claimed in claim 1 wherein said first estimation means generates a first frequency spectrum for each frame, identifies peaks in the first frequency spectrum, subjects the first frequency spectrum to a smoothing process to generate a smoothed frequency spectrum and for each candidate pitch value correlates peaks identified in said first frequency spectrum with amplitudes at different harmonic frequencies (kω_(o)) in the smoothed frequency spectrum to generate a respective said first measure of the pitch value, where ${\omega_{0} = \frac{2\quad \pi}{P}},$

P is the candidate pitch value and k is an integer.
 6. A speech coder as claimed in claim 5 wherein prior to identification of said peaks, magnitude values forming said first frequency spectrum are compared with a RMS value for the spectrum and are weighted in dependence on the comparison whereby to de-emphasise a peak having a magnitude greater than said RMS value.
 7. A speech coder as claimed in claim 6 wherein said magnitude values are further weighted by a factor which increases as a function of decreasing frequency.
 8. A speech coder as claimed in claim 7 wherein the magnitudes of said first frequency spectrum are adjusted to take account of background noise in the current frame.
 9. A speech coder as claimed in claim 5 wherein prior to correlation, the magnitude of each peak identified in the first frequency spectrum is compared with the corresponding magnitude in the smoothed frequency spectrum and is either discarded or retained in dependence on the comparison.
 10. A speech coder as claimed in claim 1 wherein said first estimation means selects a single candidate pitch value for each of a preset number of frequency bands, and said second estimation means generate a said second measure of pitch for each of the candidate pitch values selected by the first estimation means.
 11. A speech coder as claimed in claim 1 wherein said selected candidate pitch value provides an estimation of said value of pitch and the said evaluation means includes pitch refinement means for determining the value of pitch from the estimate.
 12. A speech coder as claimed in claim 11, wherein the pitch refinement means defines a set of further candidate pitch values including fractional values distributed about said estimate, generates a further frequency spectrum for the frame, identifies peaks in the further frequency spectrum, subjects said further frequency spectrum to a smoothing process to generate a further smoothed frequency spectrum, for each further candidate pitch value correlates peaks identified in the further frequency spectrum with amplitudes at different harmonic frequencies (kω_(o)) in the smoothed frequency spectrum, wherein ${\omega_{0} = \frac{2\quad \pi}{P}},$

P is a said further candidate pitch value and k is an integer, and selects as the value of pitch for the frame the further candidate pitch value giving the maximum correlation.
 13. A speech coder as claimed in claim 1 wherein said pitch determination means determines a first value of pitch for a leading part of each frame and a second value of pitch for a trailing part of each frame, and said quantisation means quantises both said values of pitch.
 14. A speech coder as claimed in any one of claims 1 to 13 wherein said voicing means determines for each frame at least one voicing cut-off frequency for separating a frequency spectrum from the frame into a voiced part and an unvoiced part, and wherein said amplitude determination means generates spectral amplitudes for each frame in response to a said voicing cut-off frequency and a said value of pitch determined by the voicing means and the pitch determination means respectively.
 15. A speech coder as claimed in claim 14, wherein for each frame said voicing means performs the following steps: (i) derives a voicing measure for each frequency band harmnonically related to a said pitch value determined by the determination means, (ii) compares the voicing measure for each harmonic frequency band with a threshold value to generate a comparison value which may be a positive value or a negative value, (iii) biasses each comparison value by an amount which reverses the sign of the comparison value if the corresponding harmonic frequency band lies above a trial cut-off frequency, (iv) sums the biassed comparison values over several harmonic frequency bands in the frame, (v) repeats steps (i) to (iv) above for a plurality of different trial cut-off frequencies, and (vi) selects as a voicing cut-off frequency for the frame the trial cut-off frequency giving the maximum summation.
 16. A speech coder as claimed in claim 15, wherein said voicing measure is formed by correlating the shape of said harmonic frequency band with a reference shape for the band.
 17. A speech coder as claimed in claim 16 including means for applying a window function to the input speech signal and deriving from the windowed input speech signal said frequency spectrum containing said harmonic frequency bands, and wherein said reference shape is derived from said window function.
 18. A speech coder as claimed in claim 14 wherein said voicing means determines a first said voicing cut-off frequency for a leading part of each frame and a second said voice cut-off frequency for a trailing part of each frame.
 19. A speech coder as claimed in claim 15 wherein said threshold value is dependent on the level of a background component in the input speech signal.
 20. A speech coder as claimed in claim 19 wherein said voicing means evaluates an estimate of said threshold value in dependence on said level of a background component, modifies the estimate according to the value of one or more of E−lf/E−hf, T₂/T₁, ZC or ER as hereinbefore defined and further modifies the estimate according to the value of one or more of PKY1,PKY2, CM and E- OR as hereinbefore defined.
 21. A speech coder as claimed in claim 1 wherein said amplitude determination means generates, for each frame, a set of spectral amplitudes for different frequency bands centred on frequencies harmonically related to a said value of pitch determined by the pitch determination means, and said quantisation means quantises the spectral amplitudes to generate a first part of an amplitude quantisation index.
 22. A speech coder as claimed in claim 1 further including a decoder, comprising means for decoding the quantisation indices generated by a said encoder and means for processing the decoded quantisation indices to generate a sequence of digital signals representing the input speech signal.
 23. A speech coder including an encoder for encoding an input speech signal, the encoder comprising means for sampling the input speech signal to produce digital samples and for dividing the samples into frames each consisting of a predetermined number of samples, linear predictive coding (LPC) means for analysing samples and generating at least one set of linear prediction coefficients for each frame, pitch determination means for determining at least one value of pitch for each frame, voicing means for defining a measure of voiced and unvoiced signals in each frame, amplitude determination means for generating amplitude information for each frame, and quantisation means for quantising said set of linear prediction coefficients, said value of pitch, said measure of voiced and unvoiced signals and said amplitude information to generate a set of quantisation indices for each frame, wherein said pitch determination means includes pitch estimation means for determining an estimate of the value of pitch and pitch refinement means for deriving the value of pitch from the estimate, the pitch refinement means defining a set of candidate pitch values including fractional values distributed about said estimate of the value of pitch determined by the pitch estimation means, identifying peaks in a frequency spectrum of the frame, for each said candidate pitch value correlating said peaks with amplitudes at different harmonic frequencies (kω_(o)) of a frequency spectrum of the frame, where ${\omega_{0} = \frac{2\quad \pi}{P}},$

P is a said candidate pitch value and k is an integer, and selecting as a said value of pitch for the frame the candidate pitch value giving the maximum correlation.
 24. A speech coder as claimed in claim 23 wherein said pitch estimation means includes first estimation means for analysing samples using a frequency domain technique (frequency domain analysis), second estimation means for analysing samples using a time domain technique (time domain analysis) and means for deriving sad estimate of the value of pitch from the results of said time and frequency domain analyses.
 25. A speech coder as claimed in claim 23 wherein the pitch refinement means correlates the amplitudes of said peaks with amplitudes at harmonic frequencies (kω_(o)) of an exponentially decaying envelope of the frequency spectrum in which the peaks were identified.
 26. A speech coder as claimed in claim 23 wherein said voicing means determines for each frame at least one voicing cut-off frequency for separating a frequency spectrum from the frame into a voiced part and an unvoiced part, and wherein said amplitude determination means generates spectral amplitudes in response to said voicing cut-off frequency and said value of pitch determined by the voicing means and the pitch determination means respectively.
 27. A speech coder as claimed in claim 26, wherein for each frame said voicing means performs the following steps: (i) derives a voicing measure for each frequency band harmonically related to said pitch value determined by the pitch determination means, (ii) compares the voicing measure for each harmonic frequency band with a threshold value to generate a comparison value which may be a positive value or a negative value, (iii) biasses each comparison value by an amount which reverses the sign of the comparison value if the corresponding harmonic frequency band lies above a trial cut-off frequency, (iv) sums the biassed comparison values over several harmonic frequency bands in the frame, (v) repeats steps (i) to (iv) above for a plurality of different trial cut-off frequencies, and (vi) selects as a voicing cut-off frequency for the frame the trial cut-off frequency giving the maximum summation.
 28. A speech coder as claimed in claim 27 wherein said voicing measure is formed by correlating the shape of said harmonic frequency band with a reference shape for the band.
 29. A speech coder as claimed in claim 28 including means for applying a window function to the input speech signal and deriving from the windowed input speech signal a frequency spectrum containing said harmonic frequency bands, and wherein said reference shape is derived from said window function.
 30. A speech coder as claimed in claim 26 wherein said voicing means generates a first said voicing cut-off frequency for a leading part of each frame and a second said voicing cut-off frequency for a trailing part of each frame.
 31. A speech coder as claimed in claim 27 wherein said threshold value is dependent on the level of a background component in the input speech signal.
 32. A speech coder as claimed in claim 23 wherein said amplitude determination means generates, for each frame, a set of spectral amplitudes for different frequency bands centred on frequencies harmonically related to a value of pitch determined by the pitch determination means and said quantisation means quantises the spectral amplitudes to generate a first part of an amplitude quantisation index.
 33. A speech coder as claimed in claim 23 wherein said pitch determination means determines a first value of pitch for a leading part of each frame and a second value of pitch for a trailing part of each frame, and said quantisation means quantises both said values of pitch.
 34. A speech coder as claimed in claim 23 further including a decoder, comprising means for decoding the quantisation indices generated by a said encoder and means for processing the decoded quantisation indices to generate a sequence of digital signals representing the input speech signal.
 35. A speech coder including an encoder for encoding an input speech signal, the encoder comprising means for sampling the input speech signal to produce digital samples and for dividing the samples into frames, each consisting of a predetermined number of samples, linear predictive coding (LPC) means for analysing samples and generating at least one set of linear prediction coefficients for each frame, pitch determination means for determining at least one value of pitch for each frame, voicing means for determining for each frame a voicing cut-off frequency for separating a frequency spectrum from the frame into a voiced part and an unvoiced part without evaluating the voiced/unvoiced status of individual harmonic frequency bands, amplitude determination means for generating amplitude information for each frame, and quantisation means for quantising said set of coefficients, said value of pitch, said voicing cut-off frequency and said amplitude information to generate a set of quantisation indices for each frame.
 36. A speech coder as claimed in claim 35, wherein for each frame said voicing means performs the following steps: (i) derives a voicing measure for each frequency band harmonically related to said pitch value determined by the pitch determination means, (ii) compares the voicing measure for each harmonic frequency band with a threshold value to generate a comparison value which may be a positive value or a negative value, (iii) biasses each comparison value by an amount which reverses the sign of the comparison value if the corresponding harmonic frequency band lies above a trial cut-off frequency, (iv) sums the biassed comparison values over several harmonic frequency bands in the frame, (v) repeats steps (i) to (iv) above for a plurality of different trial cut-off frequencies, and (vi) selects as a voicing cut-off frequency for the frame the trial cut-off frequency giving the maximum summation.
 37. A speech coder as claimed in claim 36 wherein said voicing measure is formed by correlating the shape of each harmonic frequency band with a reference shape for the band.
 38. A speech coder as claimed in claim 27 including means for applying a window function to the input speech signal and deriving from the windowed input speech signal a frequency spectrum containing said harmonic frequency bands, and wherein said reference shape is derived from said window finction.
 39. A speech coder as claimed in claim 36 wherein said threshold value is dependent on the level of a background component in the input speech signal.
 40. A speech coder as claimed in claim 35 wherein said voicing means determines a first voicing cut-off frequency for a leading part of each frame and a second voicing cut-off frequency for a trailing part of each frame, and said quantisation means quantises both said values of voicing cut-off frequency.
 41. A speech coder as claimed in claim 35 further including a decoder, comprising means for decoding the quantisation indices generated by a said encoder and means for processing the decoded quantisation indices to generate a sequence of digital signals representing the input speech signal.
 42. A speech coder including an encoder for encoding an input speech signal, the encoder comprising, means for sampling the input speech signal to produce digital samples and for dividing the samples into frames each consisting of a predetermined number of samples, linear predictive coding (LPC) means for analysing samples and generating at least one set of linear prediction coefficients for each frame, pitch determination means for determining at least one value of pitch for each frame, voicing means for defining a measure of voiced and unvoiced signals in each frame, amplitude determination means for generating amplitude information for each frame, and quantisation means for quantising said set of prediction coefficients, said value of pitch, said measure of voiced and unvoiced signals and said amplitude information to generate a set of quantisation indices for each frame, wherein the amplitude determination means generates, for each frame, a set of spectral amplitudes for frequency bands centred on frequencies harmonically related to the value of pitch determined by the pitch determination means, and the quantisation means quantises the normalised spectral amplitudes to generate a first part of an amplitude quantisation index.
 43. A speech coder as claimed in claim 42, wherein the spectral amplitudes for each frame are derived from an LPC residual signal for the frame.
 44. A speech coder as claimed in claim 42, wherein the spectral amplitudes for each frame are quantised by reference to an LPC frequency spectrum derived from prediction coefficients for the frame.
 45. A speech coder as claimed in claim 42 further including a decoder, comprising means for decoding the quantisation indices generated by a said encoder and means for processing the decoded quantisation indices to generate a sequence of digital signals representing the input speech signal.
 46. A speech coder as claimed in claim 42 including a decoder comprising means for decoding the quantisation indices generated by a said encoder and processing means for processing the decoded quantisation indices to generate a sequence of digital samples representing the input speech signal, wherein the processing means includes means for weighting the decoded spectral amplitudes derived from said first part of the amplitude quantisation index by weighting factors derived from the ration of an LPC frequency spectrum derived from the decoded prediction coefficients and a corresponding peak-interpolated LPC frequency spectrum.
 47. A speech coder including an encoder for encoding an input speech signal, the encoder comprising means for sampling the input speech signal to produce digital samples and for dividing the samples into frames each consisting of a predetermined number of samples, linear predictive coding means for analysing samples to generate a respective set of Line Spectral Frequency (LSF) coefficients for a leading part and for a trailing part of each frame, pitch determination means for determining at least one value of pitch for each frame, voicing means for defining a measure of voiced and unvoiced signals in each frame, amplitude determination means for generating amplitude information for each frame, and quantisation means for quantising said sets of LSF coefficients, said value of pitch, said measure of voiced and unvoiced signals and said amplitude information to generate a set of quantisation indices, wherein said quantisation means defines a set of quantised LSF coefficients (LSF′2) for the leading part of the current frame by the expression LSF′2=αLSF′1+(1−α)LSF′3, where LSF′3 and LSF′1 are respectively sets of quantised LSF coefficients for the trailing parts of the current frame and the frame immediately preceding the current frame, and a is a vector in a first vector quantisation codebook, defines each said set of quantised LSF coefficients LSF′2,LSF′3 for the leading and trailing parts respectively of the current frame as a combination of respective LSF quantisation vectors Q2,Q3 of a second vector quantisation codebook and respective prediction values P2,P3, where P2=λQ1 and P3=λQ2, λ is a constant and Q1 is a said LSF quantisation vector for the trailing part of said immediately preceding frame, and selects said vector Q3 and said vector a from the first and second vector quantisation codebooks respectively to minimise a measure of distortion between the LSF coefficients generated by the linear predictive coding means (LSF2, LSF3) for the current frame and the corresponding quantised LSF coefficients (LSF′2, LSF′3).
 48. A speech coder as claimed in claim 47 wherein said second vector quantisation codebook contains at least two groups of said vectors with reference to which respective groups of LSF coefficients in a set are quantised.
 49. A speech coder as claimed in claim 47 wherein said measure of distortion is an error function ε=W ₁(LS′3−LSF3)² +W ₂(LSF′2−LSF2)², where W₁ and W₂ are perceptual weights.
 50. A speech coder as claimed in claim 47 further including a decoder, comprising means for decoding the quantisation indices generated by a said encoder and means for processing the decoded quantisation indices to generate a sequence of digital signals representing the input speech signal.
 51. A speech coder for decoding a set of quantisation indices representing LSF coefficients, pitch value, a measure of voiced and unvoiced signals and amplitude information, including processor means for deriving an excitation signal from said indices representing pitch value, measure of voiced and unvoiced signals and amplitude information, a LPC synthesis filter for filtering the excitation signal in response to said LSF coefficients, means for comparing pitch cycle energy at the LPC synthesis filter output with corresponding pitch cycle energy in the excitation signal, means for modifying the excitation signal to reduce a difference between the compared pitch cycle energies and a further LPC synthesis filter for filtering the modified excitation signal. 